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Abstract — Routing  protocols  using  the  Distributed  Bellman-Ford  (DBF) 
algorithm  converge  very  slowly  to  the  correct  routes  when  link  costs  in¬ 
crease,  and  in  the  case  when  a  set  of  link  failures  results  in  a  network  parti¬ 
tion,  DBF  simply  fails  to  converge,  a  problem  which  is  commonly  referred  to 
as  the  count-to-infinity  problem.  In  this  paper,  we  present  the  first  distance 
vector  routing  algorithm  MDVA  that  uses  a  set  of  loop-free  invariants  to 
prevent  the  count-to-infinity  problem.  MDVA,  in  addition,  computes  mul¬ 
tipaths  that  are  loop-free  at  every  instant.  In  our  earlier  work  we  shows 
how  such  loop-free  multipaths  cau  be  used  iu  traffic  load-balancing  and 
minimizing  delays,  which  ofherwise  are  impossible  to  perform  in  current 
single-path  routing  algorithms  [15]. 

I.  Introduction 

Routing  protocols  construct  tables  at  each  node  that  specify 
for  each  destination  the  next-hop  to  use  for  data  packet  forward¬ 
ing.  It  is  required  that  the  routing  tables  computed  by  them  be 
free  of  loops  when  the  network  is  stable.  In  dynamic  environ¬ 
ments,  a  more  stringent  requirement  is  that  the  routing  tables  be 
loop-free  not  only  when  network  is  stable  but  at  every  instant 
because,  loops  even  if  temporary  can  rapidly  degrade  perfor¬ 
mance.  In  our  recent  work  [15],  we  described  a  load-balancing 
routing  framework  to  obtain  “near-optimal”  delays,  a  key  com¬ 
ponent  of  which  is  a  fast  responsive  routing  protocol  that  de¬ 
termines  multiple  successor  choices  for  packet  forwarding  such 
that  the  routing  graphs  implied  by  the  routing  tables  are  free  of 
loops  even  during  network  transitions.  By  load-balancing  traffic 
over  these  multiple  next-hop  choices,  congestion  and  delays  can 
be  significantly  reduced.  Our  goal  in  this  paper  is  therefore  to 
develop  a  distance-vector  routing  algorithm  that  is  suitable  for 
implementing  near-optimal  routing  as  described  in  [15]. 

Though  routing  is  a  very  old  problem  in  computer  net¬ 
works,  most  of  the  solutions  to  date  are  unsuitable  for  load¬ 
balancing  and  implementing  the  near-optimal  framework  men¬ 
tioned  above.  The  widely  deployed  routing  protocol  RIP  pro¬ 
vides  only  one  next-hop  choice  for  each  destination  and  does 
not  prevent  temporary  loops  from  forming.  Cisco’s  EIGRP  [1] 
ensures  instantaneous  loop-freedom  but  can  provide  only  a  sin¬ 
gle  loop-free  path  to  each  destination  at  any  given  router.  The 
link-state  protocol  OSPF  offers  a  router  multiple  choices  for 
packet-forwarding  only  when  those  choices  offer  the  minimum 
distance.  When  there  is  fine  granularity  in  link  costs  metric,  as  in 
the  case  of  optimal  routing,  there  is  less  likelihood  that  multiple 
paths  with  equal  distance  exist  between  each  source-destination 
pair,  which  means  the  full  connectivity  of  the  network  is  still  not 
used  for  load-balancing.  Also,  OSPF  and  other  algorithms  based 
on  topology-broadcast  (e.g.,  [13],  [10] )  incur  too  much  commu¬ 
nication  overhead  when  link  costs  change  frequently.  Also  they 
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do  not  provide  instantaneous  loop-freedom  which  is  desirable 
especially  when  on-line  link-cost  measurement  is  used. 

Several  routing  algorithms  based  on  distance  vectors  have 
been  proposed  in  the  literature  ([7],  [8],  [9],  [11],  [16]  to  name 
a  few).  However,  with  the  exception  of  DASM[16],  all  of 
them  are  single-path  algorithms.  A  few  routing  algorithms 
have  been  proposed  that  use  partial  topology  information  (re¬ 
fer  [6],  [12]  and  the  references  therein)  to  eliminate  the  main 
limitation  of  topology-broadcast  algorithms;  however,  these  al¬ 
gorithms  are  not  loop-free  at  every  instant.  Recently,  we  intro¬ 
duced  MPDA  [15],  which  is  the  first  routing  algorithm  based 
on  link-state  information  that  construct  multipaths  to  each  des¬ 
tination  that  are  loop-free  at  every  instant.  In  this  paper,  we 
present  a  new  routing  algorithm  MDVA  (Multipath  Distance- 
Vector  Algorithm),  which  is  the  first  distance  vector  algorithm 
that  uses  the  loop-free  invariants  introduced  in  [15],  solves  the 
count-to-infinity  problem  and  computes  multipaths  to  destina¬ 
tions.  We  provide  formal  proofs  for  the  safety  and  liveness  prop¬ 
erties  of  MDVA,  and  compare  its  performance  to  other  routing 
algorithms  through  simulations. 

The  paper  is  organized  as  follows.  In  Section  II  we  discuss 
the  main  convergence  problem  facing  a  typical  distance-vector 
algorithm  and  outline  a  solution  that  addresses  those  problems. 
Section  III  describes  MDVA  and  Section  IV  provides  the  cor¬ 
rectness  proof  for  the  algorithm.  A  performance  comparison 
through  simulations  is  provided  in  Section  V.  Section  VI  con¬ 
cludes  the  paper. 

II.  Overview  of  the  Approach 
A.  Problem  Formulation 

A  computer  network  is  modelled  as  a  graph  G  =  {N,L), 
where  N  is  set  of  nodes  (routers)  and  L  is  the  set  of  edges 
(links).  Fet  W*  be  the  set  of  neighbors  of  node  i.  The  problem 
consists  of  finding  the  successor  set  at  each  router  i  for  each  des¬ 
tination  j,  denoted  by  5j  C  W*,  so  that  when  router  i  receives 
a  packet  for  destination  j,  it  can  forward  it  to  one  of  the  neigh¬ 
bor  routers  in  the  successor  set  Sj.  By  repeating  this  process 
at  every  router,  the  packet  is  expected  to  reach  the  destination. 
If  the  routing  graph  SGj,  a  directed  subgraph  of  G,  is  defined 
by  the  directed  link  set  {(m,n)|n  G  5™,  m  G  N},  a  packet 
destined  for  j  follows  a  path  in  SGj .  Two  properties  determine 
the  efficiency  of  the  routing  graph  constructed  by  the  protocol: 
loop-freedom  and  connectivity.  It  is  required  that  SGj  be  free  of 
loops,  at  least  when  the  network  is  stable,  because  routing  loops 
degrade  network  performance.  In  a  dynamic  environment,  it  is 
desirable  that  SGj  be  loop-free  at  every  instant,  i.e.,  if  5j  and 
SGj  are  parameterized  by  time  t,  then  SGj  (t)  should  be  free  of 
loops  at  any  time  t.  Observe  that  if  there  is  at  most  one  element 
in  each  5j,  then  SGj  is  a  tree  and  there  is  only  one  path  from 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
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TABLE  I 
Notation 


N 

Set  of  nodes  in  the  network 

TV* 

Set  of  neighbors  of  node  i 

5* 

Next-hop  choices  at  i  for  destination  j 

SGj 

Routing  graph  implied  by  the  S]  of  destination  j 

Distance  of  node  i  to  j  as  known  to  i 

11 

Cost  of  link  (i,  k) 

D]u 

Distance  of  node  k  to  j  as  reported  by  fc  to  i 

FD] 

Feasible  distance  is  an  estimate  of  Dj 

RD) 

Distance  to  j  reported  by  node  i  to  its  neighbors 

SD] 

Best  distance  to  j  through  5j 

WN] 

Set  of  neighbors  that  are  waiting  for  replies 

G{t) 

A  bird’s  view  of  the  network  at  time  t 

Dj(f) 

Distance  of  node  i  to  j  in  G{t) 

iLW 

Cost  of  link  (i,  k)  in  G(t) 

any  node  to  j.  On  the  other  hand,  if  each  5j  has  more  than  one 
element,  then  SGj  is  a  directed  acyclic  graph  (DAG),  and  has 
greater  connectivity  than  a  simple  tree  and  therefore  enabling 
traffic  load  balancing. 

B.  Solution  Strategy 

Given  that  there  are  potentially  many  directed  acyclic  graphs 
for  a  given  destination  in  a  graph,  a  question  arises  as  to  which 
DAG  must  be  used  as  a  routing  graph?  The  routing  graph  should 
be  uniquely  defined  and  it  should  be  easily  computable  by  a  dis¬ 
tributed  algorithm.  The  natural  choice  for  routing  graph  is  the 
one  defined  by  the  shortest  paths.  Accordingly,  MDVA  defines 
Sj{t)  =  {fc|i?*(f)  <  Dj{t),k  G  N^},  where  Dj  is  the  cost  of 
the  shortest  path  from  i  to  j  measured  as  the  sum  of  the  costs 
of  the  links  on  the  path.  The  routing  graph  SGj  implied  by  this 
set  is  unique  and  is  called  the  shortest  multipath.  To  compute 
£> j ,  distributed  routing  algorithms  may  exchange  any  informa¬ 
tion  (distance-vectors  or  link-states),  as  long  as  they  ensure  that 
DJ’s  converge  to  the  correct  distances.  Convergence  is  formally 
defined  as  follows.  At  time  t,  let  G{t)  denote  the  topology  of 
the  network  as  seen  by  an  “omniscient  observer”,  and  let  Dj  (t) 
denote  the  distance  of  i  to  j  in  G{t).  (Note  that  we  use  bold 
font  to  refer  to  quantities  in  G.)  Assume  the  network  has  sta¬ 
ble  configuration  up  to  time  t.  We  say  that  the  network  has 
converged  to  the  correct  values  at  t  if  Dj(f)  =  Dj(f)  for  all 
i  and  j.  Now,  if  a  sequence  of  link  cost  changes  occur  between 
t  and  tc  and  none  after  tc,  then  the  routing  algorithm  is  said 
to  converge  if  at  some  time  f/,  it  satisfies  tc  <  tf  <  oo  and 

In  addition,  during  the  conver¬ 
gence  phase,  the  algorithm  must  ensure  that  SGj’s  are  loop-free 
at  every  instant.  Note  the  distinction  between  Dj  and  Dj .  Dj  is 
the  correct  distance,  whereas  Dj  is  a  local  variable  i  and  is  an 
’’estimate”  of  Dj.  Dj  must  eventually  equal  Dj,  if  Dj  does  not 
change  further. 

In  Distributed  Bellman-Ford  (DBF)  algorithm,  each  node  i  re¬ 
peatedly  executes  the  equation  Dj  =  min{Djf.  -h  /j.  |  fc  G  W*} 
for  each  destination  j,  and  every  time  Dj  changes  it  reports  it 
to  all  its  neighbors.  A  known  property  of  DBF  is  that  it  always 


converges,  and  converges  fast  when  distance  to  destinations  de¬ 
crease  [8].  However,  convergence  is  slow  if  link-costs  increase, 
and  in  the  extreme  case  when  link  failures  result  in  network  par¬ 
titions,  DBF  never  converges.  This  is  the  well-known  counting- 
to-infinity  problem  [14].  Intuitively,  the  count-to-infinity  prob¬ 
lem  results  due  to  “circular”  computation  of  distances;  that  is,  a 
node  computes  its  distance  to  destination  using  a  distance  com¬ 
municated  by  a  neighbor,  that  happens  to  be  the  length  of  the 
path  that  passes  through  the  node  itself.  The  node  using  such 
a  distance  is  unaware  of  this  because  nodes  only  exchange  dis¬ 
tance  information  with  no  path  information. 

Circular  computation  of  distances  that  occur  in  DBF  can  be 
prevented  if  distance  information  is  propagated  along  a  DAG 
rooted  at  a  destination.  The  key  idea  is  that  given  a  DAG, 
each  node  computes  its  distance  using  distances  reported  by 
the  “downstream”  nodes  and  reports  its  distance  to  “upstream” 
nodes.  This  method  called  diffusing  computations  was  first  sug¬ 
gested  by  Dijkstra  et  al  [4]  to  ensure  termination  of  distributed 
computation;  a  diffusion  computation  always  terminates  due  to 
the  acyclic  ordering  of  the  nodes.  DUAL  [5],  the  algorithm  on 
which  EIGRP  [1]  is  based,  uses  diffusing  computations  to  solve 
the  count-to-infinity  problem.  Several  other  distance  vector  al¬ 
gorithms  have  been  proposed  that  use  diffusing  computation  to 
overcome  the  counting-to-infinity  problem  of  DBF  [8],  [9],  [16]. 
The  algorithm  suggested  by  Jaffe  and  Moss  [8]  allows  nodes  to 
participate  in  multiple  diffusing  computation  of  the  same  desti¬ 
nation  and  requires  use  of  unbounded  counters,  for  which  rea¬ 
son  it  may  not  be  practical.  In  contrast,  a  node  in  DUAL  and 
DASM  [16]  participates  in  only  one  diffusing  computation  for 
any  destination  at  any  one  time  and  thus  requires  only  a  toggle 
bit.  MDVA  presented  here  follows  the  later  approach. 

Two  questions  arise  regarding  a  diffusing  computation: 

1 .  Because  there  are  potentially  many  DAGs  for  a  given  des¬ 
tination,  which  one  should  be  used  for  diffusing  computa¬ 
tion? 

2.  How  should  a  diffusing  computation  be  performed  in  a  dy¬ 
namic  environment  in  which  the  chosen  DAG  changes  with 
time? 

The  answer  to  the  first  question  is  straightforward:  the  short¬ 
est  multipath  SGj  is  the  right  choice  given  that  computing  SGj 
is  our  final  goal.  The  second  question  is  not  as  straightforward. 
A  SGj  used  for  carrying  out  a  diffusing  computation  can  be  al¬ 
lowed  to  change  if  the  following  conditions  hold:  (1)  SGj  is 
acyclic  at  every  instant,  and  (2)  at  any  instant,  if  a  node  reports 
a  distance  through  a  neighbor  fc  in  5j,  it  must  ensure  that  k  re¬ 
mains  in  Sj  until  the  end  of  the  diffusing  computation.  That 
these  conditions  prevent  circular  computation  of  distances  can 
be  infered  from  the  following  argument.  Assume  that  a  circular 
computation  occurs  at  time  t  involving  nodes  io,  ii,  ..  Let 
a  node  ip,  1  <  p  <  m,  compute  its  distance  at  tp  <  t  using 
the  distance  reported  by  ip-i,  and  io  computes  its  distance  us¬ 
ing  the  distance  reported  by  i^  at  to.  Because  ip_i  is  held  in 
the  successor  set  of  ip  for  1  <  p  <  m  and  io  holds  i^  until  the 
diffusing  computation  ends,  we  have 

io^Sfih)  io&Sfit) 

iiG5f(f2)  ii€S;^t) 


im-1  G  im-1  G  (t) 

i™  G  5° (to)  im^  s°{t) 

Because  the  SGj{t)  implied  by  5](t),  is  acyclic  at  every  in¬ 
stant  t,  the  above  relations  indicate  a  contradiction.  Therefore, 
circular  computation  is  impossible  if  the  above  mentioned  con¬ 
ditions  are  enforced.  Notice  that  we  intend  to  propagate  the 
distances  along  the  shortest-multipath  SGj  which  itself  is  com¬ 
puted  using  the  distances.  This  “bootstrap”  approach  -  comput¬ 
ing  Dj  using  diffusing  computation  along  SGj  and  simultane¬ 
ously  constructing  and  maintaining  SGj  -  is  central  to  MDVA. 

How  can  we  ensure  that  SGj  is  always  loop-free?  To  do  this 
we  use  a  new  variable  FDj,  called  the  feasible  distance,  which 
is  an  ‘estimate’  of  the  distance  Dj  in  that  FDj  is  equal  to  Dj 
when  the  network  is  in  stable  state,  but  to  prevent  loops  during 
periods  of  network  transitions,  it  is  allowed  to  differ  temporar¬ 
ily  from  Dj.  Let  Djj.  be  the  distance  of  k  to  j  as  notified  to  i 
by  k.  To  ensure  loop-freedom  at  every  instant  FDj,  Djf.  and 
Sj  must  satisfy  the  Loop-Free  Invariant  (LFI)  conditions  intro¬ 
duced  in  [15].  The  LFI  conditions  capture  all  previous  loop-free 
conditions  ([5],  [16])  in  a  unified  form  that  simplifies  protocol 
design.  The  conditions  are 

Loop-free  Invariant  Conditions(L¥\)  [15]: 

FD}{t)  <  D%{f)  k£N^  (1) 

5j(f)  =  {k\D%it)<FDiit)}  (2) 

The  invariant  conditions  (1)  and  (2)  state  that,  for  each  desti¬ 
nation  j,  a  node  i  can  choose  a  successor  whose  distance  to  j,  as 
known  to  i,  is  less  than  the  distance  of  node  i  to  j  that  is  known 
to  its  neighbors.  Theorem  1  is  reproduced  here  for  convenience. 

Theorem  1:  [15]  If  the  LFI  conditions  are  satisfied  at  any 
time  t,  then  the  SGj  (t)  implied  by  the  successor  sets  S'j  (t)  is 
loop-free. 

For  proof  of  this  theorem  the  reader  is  refered  to  [15].  The 
above  theorem  suggests  that  any  distributed  routing  protocol 
(link-state  or  distance-vector)  attempting  to  find  loop-free  short¬ 
est  multipaths  must  compute  Dj,  FDj  and  such  that  the 
LFI  conditions  are  satisfied,  and  such  that  at  convergence  Dj  = 
FDj  =  D]  =  minimum  distance  from  i  to  j. 

III.  Multipath  Distance-Vector  Algorithm 

In  essence,  MDVA  uses  DBF  to  compute  £>*■  and  therefore 
SGj,  while  always  propagating  distances  along  the  SGj  to  pre¬ 
vent  count-to-infinity  problem  and  ensure  termination.  Each 
node  maintains  a  main  table  that  stores  HJ,  the  successor  set 
S'j,  the  feasible  distance  FDj,  the  reported  distance  RDj,  and 
SDj,  which  is  the  shortest  distance  possible  through  the  succes¬ 
sor  set  Sj.  The  table  also  stores  WNj  C  5j,  the  set  of  waiting 
neighbors  in  a  diffusing  computation.  Each  node  also  maintains 
a  neighbor  table  for  each  neighbor  k  that  contains  the  dis¬ 
tance  of  neighbor  k  to  j  as  communicated  by  k.  The  link  table 
stores  the  cost  of  adjacent  link  to  each  neighbor  k.  At  startup 


time,  a  node  initializes  all  distances  in  its  tables  to  infinity  and 
all  sets  to  null.  If  a  link  is  down  its  cost  is  considered  infinity. 
The  distance  to  unreachable  nodes  are  considered  to  be  infinity. 

Nodes  executing  MDVA  exchange  information  using  mes¬ 
sages  which  can  have  one  or  more  entries.  An  entry  or  distance 
vector  is  of  the  form  [type,  j,  d],  where  d  is  the  distance  of  the 
node  sending  the  message  to  destination  j  and  the  type  is  one  of 
QUERY,  UPDATE  and  REPLY.  We  assume  that  messages  trans¬ 
mitted  over  an  operational  link  are  received  without  errors  and 
in  the  proper  sequence  and  are  processed  in  the  order  received. 

Nodes  invoke  the  procedure  ProcessDistVect  shown  in 
Eigure  1  to  process  distance  vectors.  An  event  is  the  arrival 
of  a  message,  the  change  in  cost  of  an  adjacent  link,  or  a 
change  in  status  (up/down)  of  an  adjacent  link.  When  an  ad¬ 
jacent  link  becomes  available,  the  node  sends  an  update  mes¬ 
sage  [update,  j,  RDj]  for  each  destination  j  over  the  link. 
When  an  adjacent  link  (i,m)  fails,  the  neighbor  table  asso¬ 
ciated  with  neighbor  m  is  cleared  and  the  cost  of  the  link  is 
set  to  infinity,  after  which,  for  each  destination  the  procedure 
ProcessDistVect{lJPDA'TE,m,oo,j)  is  invoked.  Similarly, 
when  an  adjacent  link  cost  to  m  changes,  is  set  to  the  new 
cost  and  ProcessDistVect(VPDATE,  m,  j)  is  invoked 

for  each  destination  j.  When  a  message  is  received  from  neigh¬ 
bor  k,  ProcessDistVect{type,  k,  d,  j)  is  invoked  for  each  en¬ 
try  [type,  j,  d\  of  the  message. 

Computing  distances  to  each  destination  can  be  performed  in¬ 
dependently.  Hence,  in  the  rest  of  the  description,  the  working 
of  the  algorithm  is  described  with  respect  to  a  particular  destina¬ 
tion  j.  A  node  can  be  in  ACTIVE  or  PASSIVE  state  with  respect 
to  a  destination  j  and  is  represented  by  variable  state^j.  A  node 
is  in  ACTIVE  state  when  it  is  engaged  in  a  diffusing  computation 
and  waiting  for  replies  from  neighbors.  Initially,  we  assume  that 
all  nodes  are  in  PASSiVEstate.  As  long  as  link  cost  decrease, 
MDVA  works  identically  to  DBE  and  the  nodes  will  remain  in 
PASSiVEstate.  This  is  because  the  condition  on  line  9  always 
fails  and  lines  17-24  are  always  executed.  ProcessDistVect 
works  in  such  a  way  that  when  in  PASSIVE  state,  the  condition 
Dj  =  FDj  =  RDj  =  min{Dtf.  -\-l\\k  G  W*}  always  holds, 
which  can  be  infered  from  lines  8  and  23.  However,  if  the  dis¬ 
tance  to  a  destination  increases,  either  because  an  adjacent  link 
cost  changed  or  a  message  is  received  from  a  neighbor,  the  con¬ 
dition  on  line  9  succeeds  and  the  node  engages  in  a  diffusing 
computation.  A  diffusing  computation  is  initiated  by  sending 
query  messages  to  all  the  neighbors  with  the  best  distance  SDt 
through  Sj,  and  waiting  for  the  neighbors  to  reply  (lines  14-15). 
If  the  increase  in  distance  is  due  to  a  query  from  a  successor, 
the  neighbor  is  added  to  WNj  to  indicate  that  it  is  waiting  for  a 
reply  so  that  a  reply  can  be  given  when  the  node  transits  to  PAS¬ 
SIVE  state  (lines  11-12).  When  all  replies  are  received,  the  node 
can  be  sure  that  the  neighbors  have  incorporated  the  distances 
that  the  node  reported,  and  is  safe  to  transit  to  PASSIVE  state. 
At  this  point,  FDj  can  be  increased  and  new  neighbors  can  be 
added  to  5j  without  violating  the  LEI  conditions. 

When  in  ACTIVE  state,  if  a  query  message  is  received  from 
a  neighbor  not  tn  Sj,  a  reply  is  given  immediately.  On  the 
other  hand,  if  the  query  is  from  a  neighbor  m  in  5],  a  test 
is  made  to  verify  if  SDj  increased  beyond  the  previously  re- 


00.  procedure  ProcessDistVect{et,  m,  d,  j) 

01.  {et  is  the  type,  m  is  the  neighbor,  d  is  the  distance,  j  is  the  destination  } 

02.  begin 


03. 

04. 

05. 

06. 

07. 

08. 

09. 

10. 

11. 

12. 

13. 

14. 

15. 

16. 

17. 

18. 

19. 

20. 

21. 

22. 

23. 

24. 

25. 

26. 

27. 

28. 

29. 

30. 

31. 

32. 

33. 

34.  end 


if  (j  =  thisNode  Aet  =  QUERY)  then  send  [REPLY,  j,  0]  to  m;  endif 

DU  =  d-, 

D)  ^  min{Di^  +  li\k  £  NU 
SD)^min{D)^+li\k&Si}- 

if  {state)j  =  PASSIVE  V  state'^j  =  ACTIVE  A  last  reply  is  received  for  j)  then 
FD^j  4r-  min{D^j, 
if  (£>J  >  then 

state}  ACTIVE; 
if  {et  =  QUERY)  then 
WNU  ^  rn; 
endif 

RD}  ^  SD); 

'ik  G  send  [QUERY,  j,  RD}]  to  neighbor  k; 
else 

statej  -s-  PASSIVE; 

foreach  fc  G  do 

if  {k  G  WN}  w  {k  =  mAet  =  QUERY))  then  send  [REPLY,  j,  Dj]  to  k; 
else  if  {RD)  D})  send  [UPDATE,  j,  RD}]  to  k; 

endif 

done 

RDi^D); 

WN}  ^  (t>; 

endif 

else 

if  {et  =  QUERY)  then 

if  (m  G  S]  A  SD]  >  RD})  then  WN}  ^  WN}  U  m; 
else  send  [REPLY,  j,  RD}]  to  m; 

endif 

endif 

endif 

S)  ^  {k\D%  <  FDU 


Fig.  1.  Distance  vector  processing  in  MDVA. 


ported  distance  (line  28).  If  it  did  not,  a  reply  is  sent  imme¬ 
diately.  However,  if  SD}  increased,  no  reply  is  given  and  the 
query  is  blocked  by  adding  m  to  PL N} .  The  replies  to  neigh¬ 
bors  in  W Nj  are  deferred  until  that  time  when  the  node  is  ready 
to  transit  to  PASSIVE  state.  After  receiving  all  replies,  one  of 
two  things  can  happen;  either  the  ACTIVE  phase  ends  or  it 
continues.  If  the  distance  D}  increased  again  after  receipt  of 
all  replies,  the  ACTIVE  phase  is  extended  by  sending  new  set 
of  queries,  otherwise  the  ACTIVE  phase  ends.  In  the  case  the 
ACTIVE  phase  continues,  no  replies  are  issued  to  the  pending 
queries  in  WN}.  Otherwise,  all  replies  are  given  and  the  node 
transits  to  PASSIVE  state  satisfying  the  PASSiVE-state  invariant 
D}  =  ED}  =  RD}  =  min{D}^  +  G  NQ. 

IV.  Correctness  Proofs 

To  prove  the  correctness  of  MDVA  consider  the  following  two 
mutually  exclusive  and  exhaustive  cases:  (1)  some  link  costs 


change,  but  the  distances  to  destinations  either  decrease  or  re¬ 
main  unchanged  (2)  some  link  costs  increase,  resulting  in  an 
increase  in  distances  to  some  destinations.  MDVA  works  identi¬ 
cal  to  DBE  when  distances  to  destinations  only  decrease  and  the 
same  proof  of  DBE  applies  [2].  To  state  this  formally,  assume 
the  network  is  stable  up  to  time  t  and  all  nodes  have  the  correct 
distances.  At  time  t,  the  costs  of  some  links  decrease.  Since  the 
distances  in  the  tables  are  such  that  D}{t)  >  Dj(f),  within  some 
finite  time  t',t  <  t'  <  oo,  D}{t')  =  Dj(f). 

MDVA  and  DBE  behave  differently,  when  some  link  costs 
increase  such  that  distances  between  some  source-destination 
pairs  increase.  In  this  case,  D}{t)  <  Dj(f)  for  some  i  and  j. 
Both  DBE  and  MDVA  first  increase  D}  to  a  value  greater  than 
Dj(f),  after  which  the  distances  monotonically  decrease  until 
they  converge  to  the  correct  distances.  MDVA  and  DBE,  how¬ 
ever,  differ  on  how  they  increase  the  distances.  DBE  does  it  step- 
by-step  in  small  bounded  increments  until  D}  >  Dj  {t) .  How- 


ever,  when  Dj  (t)  =  oo,  this  leads  to  the  count-to-infinity  prob¬ 
lem.  In  contrast,  MDVA  uses  diffusing  computations  to  quickly 
raise  Dj  so  that  Dj  >  Dj  (t) ,  after  which  it  functions  similar 
to  scenario  1  described  above,  and  the  distances  converge  to  the 
correct  values  as  before.  After  the  end  of  all  diffusing  computa¬ 
tions  MDVA  works  just  like  DBF. 

In  summary,  to  show  that  MDVA  terminates,  it  is  sufficient 
to  show  that:  (1)  the  SGj  are  loop-free  at  every  instant  (Theo¬ 
rem  2),  (2)  every  diffusing  computation  completes  within  a  finite 
time  (Theorem  3),  and  (3)  there  is  a  finite  number  of  diffusing 
computations  (Theorem  5).  Finally,  we  show  that  MDVA  con¬ 
verges  to  correct  distances  when  it  terminates  in  Theorem  5. 

Theorem  2:  For  a  given  destination  j,  the  SGj  constructed 
by  MDVA  is  loop-free  at  every  instant. 

Proof:  The  proof  is  by  showing  that  the  LFI  conditions 
are  satisfied  during  every  ACTIVE  and  PASSIVE  phase.  Let 
be  the  time  when  the  transition  from  PASSiVEto  ACTIVE 
state  starts  at  node  i  for  j.  The  proof  is  by  induction  on  At 
node  initialization  time  0,  all  distance  variables  are  initialized  to 
infinite  and  hence  FIIj  (0)  <  Ilj^(O),  k  G  Assume  the  LFI 
conditions  are  true  up  to  time  Then 

FDi{t)<D%{t)  fG[0,f„).  (3) 

At  any  time  t,  from  lines  6,  8,  14  and  23  in  the  pseudocode  in 

Figure  1,  and  because  SDj{t)  >  if  follows  that 

FDj.(f)  <  RDiit)  (4) 

and  therefore,  for  f„_i  and  we  have 

FD}{tn-l)  <  RD}{tn-l),  (5) 

FD){tn)  <  RD){tn).  (6) 

Let  the  queries  sent  at  the  start  time  of  the  ACTIVE  phase, 
be  received  at  a  particular  neighbor  k  ait'  >  tn-  From  Eq.  (4) 
and  the  fact  that  the  update  messages  sent,  if  any,  between  f„_i 
and  tn  specify  non-increasing  distances,  we  have 

FDi{t)<D%it)  t€[tn,t'].  (7) 

Let  t"  be  the  time  when  all  replies  are  received  and  ACTIVE 
phase  ends.  During  the  ACTIVE  phase  the  value  of  FDj  remains 
unchanged  and  no  new  RDj  is  reported  during  this  period  (lines 
27-31).  Furthermore,  during  PASSIVE  phase,  only  decreasing 
values  of  RDj  are  reported.  Then  from  Eq.  (6)  it  follows  that 

FD){t)<D%{t)  (8) 

At  t" ,  irrespective  of  whether  the  node  transits  to  PASSIVE  state 
or  continues  in  the  ACTIVE  phase,  from  Eq.  (4)  we  have 

FDiit”)  <  RDiit”).  (9) 


In  the  case  that  the  ACTIVE  phase  finally  ends,  we  have 
FDj(t)  <  for  t  G  [tn,t'']-  In  the  PASSIVE  phase, 

RDj  can  only  remain  constant  or  decrease  until  the  next  AC¬ 
TIVE  phase  at  tn+i  ■  Therefore,  the  LEI  conditions  are  satisfied 
in  the  interval  ,  tn+i ) .  On  the  other  hand,  if  the  ACTIVE  phase 
continues,  new  queries  are  sent  at  time  t" .  Assume  all  replies  for 
these  queries  are  received  at  time  t'" .  Erom  similar  argument  as 
above,  it  follows  that  FDj{t)  <  Dji,{t)  for  t  G  [t„,t'"].  Thus 
irrespective  of  how  long  the  ACTIVE  phase  continues,  the  invari¬ 
ant  holds  between  [f„,  f„+i  ].  Erom  induction,  therefore,  the  LEI 
conditions  hold  at  all  times.  It  then  follows  from  Theorem  1  that 
SGj  is  loop-free  at  all  times.  ■ 

Theorem  3:  Every  ACTIVE  phase  has  a  finite  duration. 

Proof:  An  ACTIVE  phase  may  never  end  due  to  either  of 
the  two  reasons:  deadlock  or  livelock.  Eirst  we  show  a  deadlock 
cannot  occur.  A  node  that  transits  to  ACTIVE  state  with  respect 
to  a  destination  sends  queries.  If  the  transition  is  due  to  a  query 
from  a  successor,  the  node  defers  the  reply  to  this  query  until 
it  receives  the  replies  to  its  own  queries.  Because  nodes  wait 
for  replies  to  their  queries  before  replying  to  a  query,  there  is  a 
possibility  of  “circular”  waits  leading  to  a  deadlock.  But,  this 
is  impossible  for  the  following  reasons.  Eirst,  a  node  in  pas¬ 
sive  state  immediately  replies  to  a  query  if  it  does  not  increase 
distance  to  the  destination  (lines  19).  If  the  query  is  from  a  suc¬ 
cessor  that  potentially  increases  SDj,  and  the  node  is  ACTIVE 
,  the  query  is  held  until  the  ACTIVE  phase  ends  (line  28).  Be¬ 
cause  the  SGj’s  are  loop-free  at  every  instant  (Theorem  2),  a 
deadlock  cannot  occur.  Thus,  a  node  that  issued  queries  to  the 
neighbors  will  eventually  receive  all  the  replies  and  transits  to 
PASSIVE  state. 

A  livelock  is  a  situation  where  a  node  endlessly  has  back- 
to-back  ACTIVE  phases  without  ever  replying  to  the  pending 
queries  from  the  successors.  A  livelock  cannot  occur  for  the 
following  reasons.  An  ACTIVE  phase  transition  occurs  either 
because  the  link-cost  of  an  adjacent  link  increases  or  a  query 
from  a  successor  is  received  that  increases  SDj .  But,  we  know 
that  a  query  from  a  successor  is  blocked  if  it  increases  SDj.  Be¬ 
cause  links  can  change  only  a  finite  number  of  times  and  there 
is  only  a  finite  number  of  neighbors  for  each  node  from  which 
the  node  can  receive  queries,  the  node  can  only  have  finite  num¬ 
ber  of  back  to  back  active  phases.  A  node  eventually  sends  all 
pending  replies  and  enters  PASSIVE  state.  A  livelock,  therefore, 
cannot  occur.  ■ 

Theorem  4:  A  node  can  have  only  a  finite  number  of  ACTIVE 
phases. 

Proof:  Assume  towards  a  contradiction  that  there  is  a  node 
that  does  go  through  an  infinite  number  of  PASSIVE  to  ACTIVE 
transitions.  An  active  phase  transition  occurs  either  because  of  a 
query  from  a  successor  or  a  link-cost  increase  of  an  adjacent 
link.  Because  link  costs  can  change  only  a  finite  number  of 
times,  the  infinite  PASSIVE- ACTIVE  phase  transitions  must  have 
been  triggered  by  an  infinite  number  of  queries  from  a  neigh¬ 
bor.  Let  that  neighbor  be  k.  Now,  by  the  same  argument,  k  is 
sending  an  infinite  number  of  queries  because  it  is  receiving  an 
infinite  number  of  queries.  But  this  argument  cannot  be  contin¬ 
ued  for  ever  because  there  is  only  a  finite  number  of  nodes  in  the 
network.  Because  the  reply  to  the  neighbor  in  the  successor  set 
causing  the  phase  transition  is  blocked  and  the  routing  graphs 


are  loop-free  at  every  instant  (Theorem  2),  there  must  be  a  node 
that  transits  to  ACTIVE  state  only  because  of  adjacent  link  cost 
changes.  This  implies  that  a  link  must  change  its  cost  infinite 
number  of  times  —  a  contradiction  of  assumption.  Therefore,  a 
node  cannot  have  an  infinite  number  of  ACTIVE  phases.  ■ 

Theorem  5:  After  a  finite  sequence  of  link-cost  changes  in  the 
network,  the  distances  £>*■  converge  to  the  final  correct  values. 

Proof:  Assume  at  time  0  that  every  node  has  correct  dis¬ 
tances  to  all  the  distances.  In  other  words,  T’J(O)  =  Dj(0). 
Assume  that  a  finite  number  of  link  cost  changes,  link  failures 
and  link  recoveries  occur  in  the  network  between  time  0  and 
and  after  tc  no  more  changes  occur.  We  have  to  show  that  at 
some  time  tf,  such  that  tc  <  tf  <  oo,  all  nodes  will  converge 
to  the  correct  distances.  That  is  Dj{tf)  =  Dj(fc)  =  Dj(f/). 

From  Theorem  3  and  4,  it  follows  that  within  a  finite  time 
after  the  last  link  change,  all  nodes  transit  to  PASSIVE  state  and 
remain  in  PASSIVE  state  thereafter.  Therefore,  let  t'  be  the  time 
when  the  last  ACTIVE  phase  ends  in  the  network.  We  prove  the 
following. 

1.  Dp')  >  Dj(f,)  for  every  i  and  j. 

2.  Between  t'  and  tf,  all  Dj’s  monotonically  decrease  and 
eventually  converge  to  the  correct  distances  Dj(fc)  at  tf. 
That  is  Dpf)  =  Dj(fc). 

Part  1:  Assume  towards  a  contradiction  that  Dp')  < 
Di(fe).  Leti?j(f')  =  {li{t')  +  Dpt'))  for  some  k  €  K  C  N\ 
Assume  that  Dj{t')  >  Dj'(fc).  Also  assume  that  K  has  only 
one  element.  Because  Dj(fc)  =  Ijj(fc)  +  Dj'(fc)  we  have 
Ip')  +  Dtp')  <  l|^(fc)  +  D^j{t'),  from  which  we  can  in¬ 
fer  that  either (f')  <  Vpc),  or  Dtp')  <  or  both.  If 

l].{t')  <  lk(fc)5  it  implies  that  the  link  cost  of  {i,  k)  is  not  yet  in¬ 
creased  to  l|j(fc)  via  a  link-cost  change  event.  When  it  does,  the 
condition  on  line  9  becomes  true  and  an  ACTIVE  state  transition 
is  triggered.  So  all  ACTIVE  phases  have  not  ended.  Similarly,  if 
Dtp')  <  Dj{t'),  then  there  is  message  in  transit,  which  when 
processed  by  i  would  trigger  a  PASSiVE-to-ACTiVE  transition. 
This  means  that  the  ACTIVE  phases  have  not  yet  ended.  A  con¬ 
tradiction  of  the  assumption.  Therefore,  when  ACTIVE  phases 
end  Dp')  >  Dj(fc).  When  K  has  more  than  one  element, 
each  element  will  be  removed  from  the  successor  set  one  after 
the  other  without  triggering  the  ACTIVE  transition  until  the  last 
element,  when  the  ACTIVE  state  transition  finally  occurs. 

Part  2:  After  every  node  becomes  PASSIVE  at  time  t',  all  the 
messages  in  transit  can  only  decrease  the  distances;  otherwise, 
that  would  result  in  a  transition  to  ACTIVE  state.  At  this  stage 
MDVA  works  essentially  like  DBF  and  the  same  proof  of  DBF 
applies  here.  Each  time  a  distance  is  decreased,  the  new  distance 
is  reported.  Because  distances  cannot  decrease  forever  and  are 
lower  bounded  by  Dj  (tc),  the  distances  will  eventually  converge 
to  the  correct  distances  Dj  (tc).  I 

V.  Pereormance  Analysis 

The  storage  complexity  is  of  0(|W*||W|),  as  each  of  the 
N'  neighbor  tables  and  the  main  distance  table  has  a  size 
of  0(1  W|)  entries.  The  computation  complexity  is  the  time 
taken  to  process  a  distance  vector  and  it  is  easy  to  see  that 
processDistVectorf)  takes  0(|W*|).  The  time  complexity  is 


Fig.  2.  Example  topology 


the  time  it  takes  for  the  network  to  converge  after  a  set  of  link- 
cost  changes  in  the  network  and  the  communication  complexity 
is  the  amount  of  message  overhead  required  for  propagating  a 
set  of  link-cost  changes.  In  a  dynamic  environment,  the  timing 
and  range  of  link  cost  changes  occur  in  complex  patterns  that  are 
often  determined  by  the  traffic  on  the  network,  because  of  which 
obtaining  closed  form  expressions  for  time  and  communication 
complexity  is  impossible.  An  approximate  analysis  that  is  pro¬ 
vided  in  [8]  for  the  case  in  which  communication  is  synchronous 
throughout  the  network  also  apply  to  MDVA. 

We  use  simulations  to  compare  the  control  overhead  and  con¬ 
vergence  time  of  MDVA  with  that  of  DBF,  MPDA  [15]  and 
topology  broadcast  (TOPB).  The  main  purpose  of  these  simu¬ 
lations  is  to  give  some  qualitative  explanation  for  the  behavior 
of  MDVA.  The  reason  for  choosing  DBF  and  TOPB  is  that  DBF 
is  based  on  vectors  of  distances  and  does  not  use  diffusing  com¬ 
putations,  while  TOPB  represents  an  ideal  upper  bound  on  per¬ 
formance  of  the  widely  used  routing  protocols  OSPF  and  IS-IS. 
The  reason  for  choosing  MPDA  is  that  it  has  been  shown  to 
be  very  efficient  compared  to  TOPB,  in  terms  of  communica¬ 
tion  overhead.  MDVA  achieves  loop-freedom  through  diffusing 
computations  that,  in  some  cases,  may  span  the  whole  network. 
In  contrast,  MPDA  uses  only  neighbor-to-neighbor  synchroniza¬ 
tion.  It  is  interesting  to  see  how  convergence  times  and  control 
message  overheads  are  effected  by  the  synchronization  mecha¬ 
nisms.  A  comparison  of  several  algorithms  that  does  not  include 
MPDA  and  MDVA  is  given  in  [3]. 

Simulations  are  performed  on  the  topology  shown  in  Fig. (2). 
The  simulator  used  is  an  event-driven  real-time  simulator  called 
CPT^ 

We  assume  the  computation  time  to  be  negligible  compared 
to  the  communication  times.  The  bandwidth  and  propagation 
delays  of  each  link  are  SMB  and  100/ts  respectively.  In  back¬ 
bone  networks,  links  and  nodes  are  highly  reliable  and  change 
status  much  less  frequently  than  link  costs  which  are  a  function 
of  the  traffic  on  the  link.  This  is  particularly  true  in  near-optimal 
routing  of  [15],  in  which  the  link  costs  are  periodically  mea¬ 
sured  and  reported.  For  this  reason,  in  this  paper  we  focus  on 
comparing  the  algorithms  in  scenarios  when  multiple  link-cost 
changes  occur. 

In  each  experiment,  all  links  are  initially  set  at  unit  cost  and 
then  each  link  cost  is  changed  by  amounts  determined  by  the 

^We  thank  Nokia  Wireless  Routers  for  allowing  us  using  the  C++  Protocol 
Toolkit 


Fig.  3.  Average  convergence  times,  a  =  1,  /3  =  5. 


Fig.  6.  Average  message  overhead,  a  =  —0.1,  /3  =  —0.4. 
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Fig.  4.  Average  message  overhead,  a  =  1,  /3  =  5. 


Fig.  7.  PDF  of  convergence  times,  ol  =  1,  ^  =  b,  k  =  1. 
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Fig.  5.  Average  convergence  times,  a  =  —0.1,  /3  =  —0.4. 


Fig.  8.  PDF  of  message  overhead,  ol  =  1,  ^  =  b,  k  =  1. 


formula  ak  +  Pr,  where  r  is  a  uniform  random  value  in  [0, 
1].  The  parameters  of  the  experiment  a  and  /3  are  real  values 
while  fc  is  a  positive  integer.  After  setting  the  new  link  costs, 
the  convergence  times  and  message  overheads  are  measured  for 
each  routing  algorithm.  For  each  experiment  with  specific  a,  k 
and  P,  several  trials  are  made  using  different  random  values  for 
r.  The  averages  and  probability  distributions  obtained  for  each 
metric  and  for  each  set  of  trials  are  compared. 

Fig.  3  and  Fig.  4  show  the  average  convergence  time  and  av¬ 
erage  message  load,  measured  over  several  trials,  when  the  links 
costs  are  increased  from  initial  unit  cost  to  a  cost  using  the  for¬ 
mula  ak  +  Pr  with  a  =  1,  /3  =  5  and  fc  =  0, 1,  2, 4.  As  can 


be  observed  in  Fig.  3  the  average  convergence  times  are  best 
for  MDVA.  As  can  be  seen  in  Fig.  4,  the  average  message  loads 
are  also  low  and  only  MPDA  has  lower  message  overhead.  Fig¬ 
ures  5  and  6  show  the  averages  when  link  costs  decrease.  Ob¬ 
serve  that  DBF  and  MDVA  perform  identically  as  can  be  seen 
in  the  figures. 

Fig.  7  and  Fig.  8  show  the  complete  distribution  for  conver¬ 
gence  times  and  message  overhead  for  the  case  k  =  l,a  = 
1,  /3  =  5.  Observe  that  the  distributions  are  quite  uniform  com¬ 
pared  to  DBF.  When  k  is  increase  to  5  from  1,  the  convergence 
times  and  message  overheads  of  MDVA,  as  shown  in  Fig.  9  and 
Fig.  10,  have  not  changed  much,  but  the  performance  of  DBF 
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Fig.  9.  PDF  of  convergence  times,  ol  =  b,  ^  =  b,  k  =  1. 
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Fig.  10.  PDF  of  message  overhead,  ol  =  b,  ^  =  b,  k  =  1. 
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Fig.  11.  PDF  of  convergence  times,  a  =  0,  ji  =  —0.4. 

has  degraded  considerably.  This  is  because  of  the  counting-to- 
infinity  problem,  which  is  does  not  occur  in  MDVA. 

Fig.  11  and  Fig.  12  show  the  convergence  time  and  mes¬ 
sage  overhead  distribution  when  link  costs  decrease  (a  =  0, 
P  =  —0.4).  (Note  that  we  make  sure  that  link  costs  do  not  be¬ 
come  negative.)  Observe  that  the  performance  of  MDVA  and 
DBF  are  much  the  same  which  is  because  MDVA  essentially 
functions  like  DBF  when  distances  to  destinations  decrease. 
From  these  simulations  it  appears  that  MDVA  is  a  good  choice 
if  low  convergence  times  are  desired  at  the  expense  of  high  mes¬ 
sage  overload  while  MPDA  is  preferable  if  low  message  over¬ 
head  is  desirable  over  convergence  times. 
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Fig.  12.  PDF  of  message  overhead,  a  =  0,  p  =  —0.4. 


VI.  Summary 

This  paper  presented  a  new  distributed  distance-vector  rout¬ 
ing  algorithm,  MDVA,  which  is  free  from  the  count-to-infinity 
problem,  provides  multiple  next-hop  choices  for  each  destina¬ 
tion,  and  the  routing  graphs  implied  by  them  are  always  loop- 
free.  The  novelty  of  the  algorithm  lies  in  its  design  around  a  set 
of  loop-free  invariant  conditions  which  ensures  instantaneous 
loop-freedom  and  correct  termination  of  the  protocol.  Formal 
proofs  are  presented  to  show  MDVA’s  convergence,  correctness 
and  loop-freedom.  Through  simulation  we  have  compared  it  to 
some  currently  used  routing  protocols. 
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